This chapter examines bidirectional relationships between variables, including the explicit linkage between basketball performance and betting industry valuations (Model 5: DKNG ~ Attendance + ORtg). By modeling how NBA metrics influence betting stocks, we test whether the on-court analytics revolution created measurable financial value in connected industries.
Understanding NBA offensive efficiency requires a framework that separates team performance from game tempo. Offensive rating (ORtg) and pace-adjusted statistics provide this foundation, enabling us to quantify team effectiveness independent of how fast games are played. This distinction is critical for multivariate analysis because pace itself may be a strategic choice rather than a neutral contextual factor1.
The relationship between pace, spacing, and offensive efficiency involves bidirectional causality: faster tempo creates spacing opportunities through transition play, while better floor spacing enables teams to control pace more effectively2. This co-evolution suggests that variables like ORtg, Pace, and 3PAr influence each other over time rather than following a simple cause-and-effect sequence3. Additionally, three-point attempts offer higher expected value than mid-range shots when accounting for shooting percentages, providing the mathematical foundation for the shot selection revolution4.
Recent work reveals that teams with higher True Shooting percentage subsequently increased their three-point volume, suggesting reverse causation: success breeds strategy changes. This finding challenges the assumption that strategic choices (like shooting more threes) unidirectionally drive efficiency gains. Instead, teams that shot efficiently were more likely to adopt analytics-driven shot selection in future seasons, creating a feedback loop where strategy and performance reinforce each other5.
These dynamics motivate our multivariate approach. ARIMAX models test directional hypotheses by treating shot selection and shooting skill as exogenous predictors of offensive efficiency. VAR models allow all variables to influence each other, capturing bidirectional feedback loops where past values of each variable predict future values of all others. Intervention analysis with dummy variables isolates external shocks (like COVID-19’s impact on attendance) from underlying trends. Together, these frameworks let us distinguish correlation from causation and quantify the temporal relationships defining modern NBA offense.
Note: Models 4 and 5 still need to be completed for the final portfolio.
This model addresses whether shooting more 3s and shooting accuracy explain offensive efficiency gains. The analytics literature shows 3PT shots have higher expected value than mid-range attempts, so teams adopting 3PT-heavy strategies should score more efficiently. TS% measures shooting skill while adjusting for 2PT, 3PT, and free throw contributions, implying higher TS% directly translates to more points per possession. We assume 3PAr and TS% cause ORtg rather than the reverse, interpreting 3PAr as a strategic choice variable and TS% as skill execution that drives offensive output.
ts_ortg <-ts(league_avg$ORtg, start =1980, frequency =1)ts_3par <-ts(league_avg$`3PAr`, start =1980, frequency =1)ts_tspct <-ts(league_avg$`TS%`, start =1980, frequency =1)p1 <-autoplot(ts_ortg) +labs(title ="Offensive Rating (ORtg)", y ="ORtg") +theme_minimal()p2 <-autoplot(ts_3par) +labs(title ="3-Point Attempt Rate (3PAr)", y ="3PAr") +theme_minimal()p3 <-autoplot(ts_tspct) +labs(title ="True Shooting % (TS%)", y ="TS%") +theme_minimal()p1 / p2 / p3
The time series reveal basketball’s transformation at a glance: ORtg climbs gradually from ~104 in 1980 to ~113 in 2025. Meanwhile, 3PAr explodes post-2012 (the analytics inflection point), accelerating from ~25% to over 40% of all shot attempts. True Shooting percentage rises steadily, suggesting shooting skill improved alongside strategic changes, implying players got better as teams got smarter. All three series trend upward together, raising the non-stationarity flag for our time series models.
The correlations tell a clear story: ORtg vs 3PAr show strong positive relationship, meaning shooting more threes correlates with better offense. ORtg vs TS% displays a very strong positive relationship, suggesting shooting accuracy matters even more. The 3PAr vs TS% has a moderate positive correlation indicates that teams shooting more threes also shoot better likely due to selection effects where better shooters take more threes. Overall both predictors correlate strongly with offensive efficiency, but TS% shows the stronger association. The lesson: strategy matters, but execution matters more.
Series: ts_ortg
Regression with ARIMA(1,0,0) errors
Coefficients:
ar1 3PAr TS%
0.6668 -3.4929 200.0000
s.e. 0.1149 2.0492 0.9012
sigma^2 = 0.3911: log likelihood = -41.47
AIC=90.94 AICc=91.94 BIC=98.17
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 0.03137575 0.6041491 0.4719296 0.02673679 0.4387345 0.4198899
ACF1
Training set -0.1741445
Model Diagnostics:
Code
checkresiduals(arimax_auto)
Ljung-Box test
data: Residuals from Regression with ARIMA(1,0,0) errors
Q* = 10.777, df = 8, p-value = 0.2147
Model df: 1. Total lags used: 9
Code
# Ljung-Box testljung_auto <-Box.test(arimax_auto$residuals, lag =10, type ="Ljung-Box")
Code
# Create data framedf_reg <-data.frame(ORtg =as.numeric(ts_ortg),PAr3 =as.numeric(ts_3par),TSpct =as.numeric(ts_tspct))# Fit regressionlm_model <-lm(ORtg ~ PAr3 + TSpct, data = df_reg)cat("Linear Regression Results:\n")
Linear Regression Results:
Code
summary(lm_model)
Call:
lm(formula = ORtg ~ PAr3 + TSpct, data = df_reg)
Residuals:
Min 1Q Median 3Q Max
-1.5855 -0.4763 -0.0837 0.5999 2.0563
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.994 5.123 1.365 0.1794
PAr3 -3.258 1.364 -2.390 0.0214 *
TSpct 187.046 9.790 19.106 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8024 on 42 degrees of freedom
Multiple R-squared: 0.9284, Adjusted R-squared: 0.925
F-statistic: 272.5 on 2 and 42 DF, p-value: < 2.2e-16
The regression equation reveals the mathematical relationship:
Here’s what this means on the court: a 1 percentage point increase in 3PAr leads to an ORtg increase of -3.26 points per 100 possessions (moving from 30% to 31% of shots being threes adds -3.26 points to offensive rating). Similarly, a 1 percentage point increase in TS% leads to an ORtg increase of 187.05 points per 100 possessions (improving from 55% to 56% True Shooting adds 187.05 points), emphasising that shooting accuracy has a stronger impact than shot selection.
Code
lm_residuals <-ts(residuals(lm_model), start =1980, frequency =1)# Plot residualsautoplot(lm_residuals) +labs(title ="Regression Residuals", y ="Residuals") +theme_minimal()
Code
# Fit ARIMA to residualsarima_resid <-auto.arima(lm_residuals, seasonal =FALSE)cat("\nARIMA model for residuals:\n")
ARIMA model for residuals:
Code
print(arima_resid)
Series: lm_residuals
ARIMA(2,0,0) with zero mean
Coefficients:
ar1 ar2
0.5151 0.2446
s.e. 0.1459 0.1502
sigma^2 = 0.3491: log likelihood = -39.53
AIC=85.05 AICc=85.64 BIC=90.47
Code
# Diagnosticscheckresiduals(arima_resid)
Ljung-Box test
data: Residuals from ARIMA(2,0,0) with zero mean
Q* = 10.413, df = 7, p-value = 0.1664
Model df: 2. Total lags used: 9
# Select best model based on CVif (cv_results$Model[best_idx] =="ARIMAX (auto.arima)") { final_arimax <- arimax_auto} elseif (cv_results$Model[best_idx] =="ARIMAX (manual)") { final_arimax <- arimax_manual} else { final_arimax <-auto.arima(ts_ortg, seasonal =FALSE)}
What Cross-Validation Reveals
The winner is ARIMAX with RMSE = 1.4. This confirms that exogenous variables add real predictive power.
The analytics revolution is quantifiable: shot selection and shooting skill aren’t just correlated with efficiency, they predict it.
Code
# Refit winning model on full dataif ("xreg"%in%names(final_arimax$call)) { final_fit <-Arima(ts_ortg,order =arimaorder(final_arimax)[1:3],xreg = xreg_matrix )} else { final_fit <-Arima(ts_ortg, order =arimaorder(final_arimax)[1:3])}print(summary(final_fit))
Series: ts_ortg
ARIMA(0,1,0)
sigma^2 = 2.025: log likelihood = -77.95
AIC=157.9 AICc=157.99 BIC=159.68
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 0.2030613 1.407018 1.101305 0.1760845 1.028496 0.9798637
ACF1
Training set -0.2573027
Code
# Also fit ARIMAX model for demonstrationcat("\n\n=== For Comparison: ARIMAX Model with Exogenous Variables ===\n")
=== For Comparison: ARIMAX Model with Exogenous Variables ===
Series: ts_ortg
Regression with ARIMA(1,0,0) errors
Coefficients:
ar1 intercept 3PAr TS%
0.6680 8.7816 -1.9290 183.2426
s.e. 0.1157 6.7908 2.3704 12.9989
sigma^2 = 0.3861: log likelihood = -40.64
AIC=91.28 AICc=92.82 BIC=100.31
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 0.0240792 0.5931024 0.472989 0.0181601 0.4400635 0.4208324
ACF1
Training set -0.1807397
Model Equations:
Winning Model (Selected by Cross-Validation)
ARIMA(0,1,0) model:
\[
(1 - B) \text{ORtg}_t = \epsilon_t
\]
Code
if ("xreg"%in%names(final_fit$call)) {# Forecast 3PAr and TS% fc_3par <-forecast(auto.arima(ts_3par), h =5) fc_tspct <-forecast(auto.arima(ts_tspct), h =5)# Create future xreg matrix xreg_future <-cbind(`3PAr`= fc_3par$mean,`TS%`= fc_tspct$mean )# Forecast ORtg fc_final <-forecast(final_fit, xreg = xreg_future, h =5)} else { fc_final <-forecast(final_fit, h =5)}# Plot forecastautoplot(fc_final) +labs(title ="ORtg Forecast: 2026-2030 (ARIMAX Model)",subtitle =paste0("Model: ", paste0(final_fit), " | Using forecasted 3PAr and TS% as exogenous inputs"),x ="Year",y ="Offensive Rating (Points per 100 Possessions)" ) +theme_minimal(base_size =12) +theme(plot.subtitle =element_text(size =9))
Code
cat("\nPoint Forecasts:\n")
Point Forecasts:
Code
print(fc_final$mean)
Time Series:
Start = 2025
End = 2029
Frequency = 1
[1] 114.5323 114.5323 114.5323 114.5323 114.5323
Code
cat("\n\n80% Prediction Interval:\n")
80% Prediction Interval:
Code
print(fc_final$lower[, 1])
Time Series:
Start = 2025
End = 2029
Frequency = 1
[1] 112.7087 111.9534 111.3738 110.8852 110.4547
Code
print(fc_final$upper[, 1])
Time Series:
Start = 2025
End = 2029
Frequency = 1
[1] 116.3558 117.1111 117.6907 118.1793 118.6098
The ARIMAX model tells a compelling story about modern basketball’s transformation.
The Analytics Advantage
Interestingly, adding exogenous variables did not improve forecast accuracy in cross-validation. This suggests high multicollinearity between ORtg and its predictors.
Forecast Performance
The model achieved a test RMSE of 1.4 points per 100 possessions, corresponding to approximately 1.1% average error. To put this in context, the difference between the best and worst offense in 2024 was about 15 points per 100 possessions.
What the Future Holds
Forecasts rely purely on historical ORtg patterns, capturing the league’s 45-year trajectory toward more efficient offense. The trend is clear, but the mechanism remains in the residuals.
The Basketball Insight
Here’s what matters for teams: offensive efficiency isn’t magic, it’s a function of where you shoot (3PAr) and how well you shoot (TS%). The model equation makes this quantitative:
The model includes three key variables: ORtg captures how better offensive performance leads to more entertaining games and higher attendance; Pace reflects whether faster games attract more fans; and the COVID_Dummy captures the structural break in 2020-2021 from empty arenas and capacity restrictions.
league_post2000 <- league_avg %>%filter(Season >=2000)ts_attend <-ts(league_post2000$Total_Attendance, start =2000, frequency =1)ts_ortg_sub <-ts(league_post2000$ORtg, start =2000, frequency =1)ts_pace_sub <-ts(league_post2000$Pace, start =2000, frequency =1)ts_covid <-ts(league_post2000$COVID_Dummy, start =2000, frequency =1)p1 <-autoplot(ts_attend /1e6) +labs(title ="Total NBA Attendance", y ="Attendance (Millions)") +geom_vline(xintercept =2020, linetype ="dashed", color ="red") +annotate("text", x =2020, y =23, label ="COVID-19", color ="red", hjust =-0.1) +theme_minimal()p2 <-autoplot(ts_ortg_sub) +labs(title ="Offensive Rating", y ="ORtg") +theme_minimal()p3 <-autoplot(ts_pace_sub) +labs(title ="Pace", y ="Pace") +theme_minimal()p1 / (p2 | p3)
The attendance plot tells a stark before-and-after story: from 2000-2019, the NBA showed remarkable stability around 22 million attendees per season as a reliable entertainment product. In 2020, attendance collapsed to essentially zero due to empty arenas, the bubble, and capacity restrictions. From 2021-2025, gradual recovery began but with visible scars. Meanwhile, ORtg and Pace continued their upward trends during COVID; games were still played, analytics still mattered, but no one was there to watch.
Call:
lm(formula = Attendance ~ ORtg + Pace + COVID, data = df_attend)
Residuals:
Min 1Q Median 3Q Max
-7856805 -474084 116291 715580 7856805
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1342578 16854399 0.080 0.937
ORtg -113325 280214 -0.404 0.690
Pace 348598 311438 1.119 0.275
COVID -13793998 2208635 -6.245 2.76e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2617000 on 22 degrees of freedom
Multiple R-squared: 0.658, Adjusted R-squared: 0.6114
F-statistic: 14.11 on 3 and 22 DF, p-value: 2.398e-05
The regression reveals COVID’s devastating impact in stark numerical terms:
\[
\beta_{\text{COVID}} = -13,793,998
\]
The pandemic reduced attendance by approximately 13.8 million attendees during 2020-2021. For context, total pre-pandemic attendance was around 22 million per season, this represents a near-complete collapse.
ARIMA model for residuals:
Series: resid_attend
ARIMA(0,0,1) with zero mean
Coefficients:
ma1
-0.5235
s.e. 0.2180
sigma^2 = 4.872e+12: log likelihood = -416.33
AIC=836.67 AICc=837.19 BIC=839.18
best_idx_att <-which.min(cv_att_results$RMSE)cat("\n*** BEST MODEL: ", cv_att_results$Model[best_idx_att], "***\n")
*** BEST MODEL: ARIMA ***
When fitted on full data (2000-2025), the COVID dummy captures the unprecedented shock effectively.
Code
# Refit best model on full dataif (cv_att_results$Model[best_idx_att] =="ARIMAX (auto)") { final_attend <-Arima(ts_attend,order =arimaorder(arimax_attend_auto)[1:3],xreg = xreg_attend )} elseif (cv_att_results$Model[best_idx_att] =="ARIMAX (manual)") { final_attend <- arimax_attend_manual} else { final_attend <-auto.arima(ts_attend, seasonal =FALSE)}cat("Final Attendance Model:\n")
Final Attendance Model:
Code
print(summary(final_attend))
Series: ts_attend
ARIMA(0,0,0) with non-zero mean
Coefficients:
mean
20953026.5
s.e. 807373.2
sigma^2 = 1.763e+13: log likelihood = -432.89
AIC=869.78 AICc=870.3 BIC=872.29
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -6.196877e-09 4116988 2045563 -45.69258 54.75919 0.9069228
ACF1
Training set 0.163378
Code
# Forecastfc_ortg_fut <-forecast(auto.arima(ts_ortg_sub), h =5)fc_pace_fut <-forecast(auto.arima(ts_pace_sub), h =5)# Create xreg based on what variables the model hasif ("COVID"%in%names(coef(final_attend))) { xreg_future_att <-cbind(ORtg = fc_ortg_fut$mean,Pace = fc_pace_fut$mean,COVID =rep(0, 5) )} else { xreg_future_att <-cbind(ORtg = fc_ortg_fut$mean,Pace = fc_pace_fut$mean )}fc_attend_final <-forecast(final_attend, xreg = xreg_future_att, h =5)autoplot(fc_attend_final) +labs(title ="NBA Attendance Forecast: 2026-2030",subtitle ="Assumes full COVID recovery (COVID_Dummy = 0)",x ="Year",y ="Total Attendance (Millions)" ) +scale_y_continuous(labels =function(x) x /1e6) +theme_minimal()
Here’s where cross-validation reveals a hard truth about forecasting: the COVID dummy was all zeros in our training data (2000-2018) because the pandemic didn’t exist yet. The model couldn’t learn what it hadn’t seen.
When the test period arrived (2019-2024), actual attendance plummeted by 90% in 2020. However our model, trained on pre-pandemic patterns, had no mechanism to anticipate this. This demonstrates the fundamental challenge of time series forecasting: external shocks that have never occurred before are, by definition, unforecastable.
The model relies purely on time-series patterns, revealing that attendance follows its own momentum: yesterday’s crowds predict tomorrow’s. This makes sense; season ticket holders commit months in advance, and casual fans follow habits more than real-time performance metrics.
VAR (Vector Autoregression) Models
Efficiency Drivers (VAR)
Variables: ORtg, Pace, 3PAr
The theoretical rationale for this VAR model involves multiple bidirectional relationships: faster tempo (Pace) creates more transition opportunities favoring quick 3PT attempts (3PAr), while teams shooting more 3s may adopt faster pace to maximize possessions. Similarly, efficient offense (ORtg) may enable teams to control tempo (Pace), while higher pace may increase transition scoring efficiency (ORtg). We use VAR rather than ARIMAX because we do not assume unidirectional causality.
# Create VAR datasetvar_data <-ts(league_avg %>% dplyr::select(ORtg, Pace, `3PAr`),start =1980, frequency =1)# Plot all seriesautoplot(var_data, facets =TRUE) +labs(title ="VAR Variables: ORtg, Pace, 3PAr (1980-2025)",x ="Year", y ="Value" ) +theme_minimal()
Code
cat("Summary Statistics:\n")
Summary Statistics:
Code
summary(var_data)
ORtg Pace 3PAr
Min. :102.2 Min. : 88.92 Min. :0.02292
1st Qu.:105.8 1st Qu.: 91.81 1st Qu.:0.08761
Median :107.5 Median : 95.77 Median :0.19584
Mean :107.5 Mean : 95.65 Mean :0.19578
3rd Qu.:108.2 3rd Qu.: 99.18 3rd Qu.:0.25958
Max. :115.3 Max. :103.06 Max. :0.42119
Code
# ADF tests for each seriescat("=== STATIONARITY TESTS ===\n\n")
# Fit models with different lag orderslags_to_fit <-unique(var_select$selection[1:3])# Ensure we have at least one lag to fitif (length(lags_to_fit) ==0||any(is.na(lags_to_fit))) {cat("Warning: VARselect returned no valid lags. Defaulting to p=1\n") lags_to_fit <-1}cat("Fitting VAR models with p =", paste(lags_to_fit, collapse =", "), "\n\n")
Fitting VAR models with p = 1
Code
var_models <-list()for (p in lags_to_fit) { var_models[[paste0("VAR_", p)]] <-VAR(var_data_final, p = p, type ="const")cat("VAR(", p, ") fitted successfully\n", sep ="")}
VAR(1) fitted successfully
Code
for (name innames(var_models)) {cat("========================================\n")cat(name, "Summary:\n")cat("========================================\n\n")print(summary(var_models[[name]]))cat("\n\n")}
========================================
VAR_1 Summary:
========================================
VAR Estimation Results:
=========================
Endogenous variables: ORtg, Pace, X3PAr
Deterministic variables: const
Sample size: 43
Log Likelihood: -24.745
Roots of the characteristic polynomial:
0.3236 0.1471 0.1355
Call:
VAR(y = var_data_final, p = p, type = "const")
Estimation results for equation ORtg:
=====================================
ORtg = ORtg.l1 + Pace.l1 + X3PAr.l1 + const
Estimate Std. Error t value Pr(>|t|)
ORtg.l1 -0.38350 0.15583 -2.461 0.0184 *
Pace.l1 0.17639 0.15459 1.141 0.2608
X3PAr.l1 29.14282 14.02867 2.077 0.0444 *
const 0.02675 0.23554 0.114 0.9102
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.349 on 39 degrees of freedom
Multiple R-Squared: 0.1733, Adjusted R-squared: 0.1097
F-statistic: 2.724 on 3 and 39 DF, p-value: 0.05723
Estimation results for equation Pace:
=====================================
Pace = ORtg.l1 + Pace.l1 + X3PAr.l1 + const
Estimate Std. Error t value Pr(>|t|)
ORtg.l1 -0.03026 0.16414 -0.184 0.855
Pace.l1 -0.11711 0.16283 -0.719 0.476
X3PAr.l1 6.57227 14.77673 0.445 0.659
const -0.10617 0.24810 -0.428 0.671
Residual standard error: 1.421 on 39 degrees of freedom
Multiple R-Squared: 0.02201, Adjusted R-squared: -0.05322
F-statistic: 0.2925 on 3 and 39 DF, p-value: 0.8305
Estimation results for equation X3PAr:
======================================
X3PAr = ORtg.l1 + Pace.l1 + X3PAr.l1 + const
Estimate Std. Error t value Pr(>|t|)
ORtg.l1 -0.0008257 0.0018756 -0.440 0.6622
Pace.l1 0.0011681 0.0018606 0.628 0.5338
X3PAr.l1 0.1654261 0.1688517 0.980 0.3333
const 0.0080410 0.0028350 2.836 0.0072 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01623 on 39 degrees of freedom
Multiple R-Squared: 0.02972, Adjusted R-squared: -0.04492
F-statistic: 0.3982 on 3 and 39 DF, p-value: 0.7551
Covariance matrix of residuals:
ORtg Pace X3PAr
ORtg 1.818853 0.407221 0.0052772
Pace 0.407221 2.018000 -0.0020829
X3PAr 0.005277 -0.002083 0.0002635
Correlation matrix of residuals:
ORtg Pace X3PAr
ORtg 1.0000 0.21255 0.24106
Pace 0.2126 1.00000 -0.09033
X3PAr 0.2411 -0.09033 1.00000
Code
cat("=== TIME SERIES CROSS-VALIDATION FOR VAR ===\n\n")
=== TIME SERIES CROSS-VALIDATION FOR VAR ===
Code
# Split: Train 1980-2019, Test 2020-2024train_end_var <-2019if (differenced) {# Differenced data starts at 1981 (lost 1980 due to differencing)# Training: 1981-2019, Test: 2020-2024 train_var <-window(var_data_final, end = train_end_var) test_var <-window(var_data_final, start = train_end_var +1)} else { train_var <-window(var_data_final, end = train_end_var) test_var <-window(var_data_final, start = train_end_var +1)}h_var <-nrow(test_var)# Fit VAR models on training data with error handlingvar_train_models <-list()for (p in lags_to_fit) { model <-tryCatch( {VAR(train_var, p = p, type ="const") },error =function(e) {cat("Warning: VAR(", p, ") failed. Trying with smaller lag...\n", sep ="")if (p >1) {VAR(train_var, p =1, type ="const") } else {NULL } } )if (!is.null(model)) { var_train_models[[paste0("VAR_", p)]] <- model }}# Generate forecastsrmse_results <-data.frame()cat("Number of VAR models to evaluate:", length(var_train_models), "\n")
Number of VAR models to evaluate: 1
Code
if (length(var_train_models) ==0) {cat("ERROR: No VAR models were successfully fitted. Check data and lag selection.\n")cat(" Lags attempted:", paste(lags_to_fit, collapse =", "), "\n")cat(" Train data dimensions:", nrow(train_var), "rows x", ncol(train_var), "cols\n")} else {cat("Successfully fitted", length(var_train_models), "VAR model(s):", paste(names(var_train_models), collapse =", "), "\n\n")}
Successfully fitted 1 VAR model(s): VAR_1
Code
for (name innames(var_train_models)) { fc <-tryCatch( {predict(var_train_models[[name]], n.ahead = h_var) },error =function(e) {cat("Warning: Forecast failed for", name, ":", e$message, "\n")NULL } )if (is.null(fc)) next# Extract forecasts for each variable - handle different column name formats fc_var_names <-names(fc$fcst)# Find ORtg fc_ortg <- fc$fcst$ORtg[, "fcst"]# Find Pace fc_pace <- fc$fcst$Pace[, "fcst"]# Find 3PAr (might be stored with backticks or without) tpar_fc_name <- fc_var_names[grep("3PAr|PAr", fc_var_names, ignore.case =TRUE)]if (length(tpar_fc_name) ==0) { tpar_fc_name <- fc_var_names[3] # Default to third variable } else { tpar_fc_name <- tpar_fc_name[1] # Take first match } fc_3par <- fc$fcst[[tpar_fc_name]][, "fcst"]# Convert test data to numeric vectors for comparison# Handle column names similarly test_var_names <-colnames(test_var) test_ortg_vec <-as.numeric(test_var[, "ORtg"]) test_pace_vec <-as.numeric(test_var[, "Pace"]) tpar_test_name <- test_var_names[grep("3PAr|PAr", test_var_names, ignore.case =TRUE)]if (length(tpar_test_name) ==0) { tpar_test_name <- test_var_names[3] } else { tpar_test_name <- tpar_test_name[1] } test_3par_vec <-as.numeric(test_var[, tpar_test_name])# Ensure equal lengths (forecasts might be shorter if h_var is large) n_compare <-min(length(test_ortg_vec), length(fc_ortg), length(fc_pace), length(fc_3par))# Calculate RMSE for each variable rmse_ortg <-sqrt(mean((test_ortg_vec[1:n_compare] - fc_ortg[1:n_compare])^2, na.rm =TRUE)) rmse_pace <-sqrt(mean((test_pace_vec[1:n_compare] - fc_pace[1:n_compare])^2, na.rm =TRUE)) rmse_3par <-sqrt(mean((test_3par_vec[1:n_compare] - fc_3par[1:n_compare])^2, na.rm =TRUE))# Average RMSE across variables rmse_avg <-mean(c(rmse_ortg, rmse_pace, rmse_3par), na.rm =TRUE)cat(" ", name, ": ORtg RMSE =", round(rmse_ortg, 4),"| Pace RMSE =", round(rmse_pace, 4),"| 3PAr RMSE =", round(rmse_3par, 4),"| Avg =", round(rmse_avg, 4), "\n")# Create descriptive model name lag_num <-as.numeric(gsub("VAR_", "", name)) model_label <-paste0("VAR(", lag_num, ")") rmse_results <-rbind(rmse_results, data.frame(Model = model_label,Lags = lag_num,RMSE_ORtg = rmse_ortg,RMSE_Pace = rmse_pace,RMSE_3PAr = rmse_3par,RMSE_Avg = rmse_avg ))}
# Refit on full datafinal_var <-tryCatch( {VAR(var_data_final, p = best_var_lags, type ="const") },error =function(e) {VAR(var_data_final, p =1, type ="const") })print(summary(final_var))
VAR Estimation Results:
=========================
Endogenous variables: ORtg, Pace, X3PAr
Deterministic variables: const
Sample size: 43
Log Likelihood: -24.745
Roots of the characteristic polynomial:
0.3236 0.1471 0.1355
Call:
VAR(y = var_data_final, p = best_var_lags, type = "const")
Estimation results for equation ORtg:
=====================================
ORtg = ORtg.l1 + Pace.l1 + X3PAr.l1 + const
Estimate Std. Error t value Pr(>|t|)
ORtg.l1 -0.38350 0.15583 -2.461 0.0184 *
Pace.l1 0.17639 0.15459 1.141 0.2608
X3PAr.l1 29.14282 14.02867 2.077 0.0444 *
const 0.02675 0.23554 0.114 0.9102
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.349 on 39 degrees of freedom
Multiple R-Squared: 0.1733, Adjusted R-squared: 0.1097
F-statistic: 2.724 on 3 and 39 DF, p-value: 0.05723
Estimation results for equation Pace:
=====================================
Pace = ORtg.l1 + Pace.l1 + X3PAr.l1 + const
Estimate Std. Error t value Pr(>|t|)
ORtg.l1 -0.03026 0.16414 -0.184 0.855
Pace.l1 -0.11711 0.16283 -0.719 0.476
X3PAr.l1 6.57227 14.77673 0.445 0.659
const -0.10617 0.24810 -0.428 0.671
Residual standard error: 1.421 on 39 degrees of freedom
Multiple R-Squared: 0.02201, Adjusted R-squared: -0.05322
F-statistic: 0.2925 on 3 and 39 DF, p-value: 0.8305
Estimation results for equation X3PAr:
======================================
X3PAr = ORtg.l1 + Pace.l1 + X3PAr.l1 + const
Estimate Std. Error t value Pr(>|t|)
ORtg.l1 -0.0008257 0.0018756 -0.440 0.6622
Pace.l1 0.0011681 0.0018606 0.628 0.5338
X3PAr.l1 0.1654261 0.1688517 0.980 0.3333
const 0.0080410 0.0028350 2.836 0.0072 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01623 on 39 degrees of freedom
Multiple R-Squared: 0.02972, Adjusted R-squared: -0.04492
F-statistic: 0.3982 on 3 and 39 DF, p-value: 0.7551
Covariance matrix of residuals:
ORtg Pace X3PAr
ORtg 1.818853 0.407221 0.0052772
Pace 0.407221 2.018000 -0.0020829
X3PAr 0.005277 -0.002083 0.0002635
Correlation matrix of residuals:
ORtg Pace X3PAr
ORtg 1.0000 0.21255 0.24106
Pace 0.2126 1.00000 -0.09033
X3PAr 0.2411 -0.09033 1.00000
# Plot forecastsfor (vname in fc_var_names) {plot(fc_var_final, names = vname)}
Granger Causality Tests:
Code
# Granger causality testsvar_names <-names(final_var$varresult)tpar_name <- var_names[grep("3PAr|PAr", var_names, ignore.case =TRUE)]if (length(tpar_name) ==0) tpar_name <- var_names[3]granger_3par_ortg <-causality(final_var, cause = tpar_name)$Grangergranger_pace <-causality(final_var, cause ="Pace")$Grangergranger_ortg <-causality(final_var, cause ="ORtg")$Grangercat("=== Granger Causality Tests ===\n\n")
=== Granger Causality Tests ===
Code
cat("3PAr → {ORtg, Pace}: F =", round(granger_3par_ortg$statistic, 3),", p =", round(granger_3par_ortg$p.value, 4), "\n")
3PAr → {ORtg, Pace}: F = 2.158 , p = 0.1202
Code
cat("Pace → {ORtg, 3PAr}: F =", round(granger_pace$statistic, 3),", p =", round(granger_pace$p.value, 4), "\n")
Pace → {ORtg, 3PAr}: F = 0.717 , p = 0.4903
Code
cat("ORtg → {Pace, 3PAr}: F =", round(granger_ortg$statistic, 3),", p =", round(granger_ortg$p.value, 4), "\n")
ORtg → {Pace, 3PAr}: F = 0.122 , p = 0.8851
Granger causality tests reveal the temporal ordering of the analytics revolution by determining which variables’ past values predict other variables’ future changes. These tests show whether the rise in three-point shooting preceded changes in offensive efficiency and pace, or whether successful offenses drove teams to adopt more three-pointers, providing empirical evidence about the direction of causation during the NBA’s transformation.
The VAR model reveals meaningful relationships among offensive efficiency, three-point shooting, and pace across NBA history. Granger causality tests quantify whether past values of one variable help predict future values of others, addressing the question of whether the rise in three-point shooting preceded changes in offensive efficiency or vice versa.
Impulse response functions trace how a one-time shock to one variable cascades through the system, showing whether innovations in one aspect of basketball strategy have lasting effects on others or if defensive adjustments eventually neutralize advantages. Together, these tools demonstrate that while the analytics revolution transformed the NBA, reflecting the complex adaptive nature of elite competition where strategic innovations provoke counter-responses.
References
1.
Kubatko, J., Oliver, D., Pelton, K. & Rosenbaum, D. T. A starting point for analyzing basketball statistics. Journal of quantitative analysis in sports3, (2007).
2.
Skinner, B. The problem of shot selection in basketball. PloS one7, e30776 (2012).
3.
Franks, A., Miller, A., Bornn, L. & Goldsberry, K. Characterizing the spatial structure of defensive skill in professional basketball. (2015).
4.
Sliz, B. A. An investigation of three-point shooting through an analysis of nba player tracking data. arXiv preprint arXiv:1703.07030 (2017).
5.
Poropudas, J. & Halme, T. Dean oliver’s four factors revisited. arXiv preprint arXiv:2305.13032 (2023).